← Back to Contents
Note: This page's design, presentation and content have been created and enhanced using Claude (Anthropic's AI assistant) to improve visual quality and educational experience.
Week 8 • Sub-Lesson 1

👁 What Multimodal AI Can See, Hear, and Read

From text-only tools to models that see images, read documents, transcribe audio, and process hours of video — and what this means for research

What We'll Cover

The assumption that AI is a text tool is already outdated. Modern models can see images, read documents, transcribe audio, and process hours of video. This week is about understanding what these capabilities genuinely offer researchers — and where the gap between “impressive demo” and “reliable research tool” remains significant.

A key framing distinction sits at the centre of everything this week: AI reading vs. AI understanding. These are not the same thing. A model can describe a chart fluently and still get the numbers wrong. It can transcribe speech accurately and still hallucinate sentences that were never said. It can extract text from a PDF and still misread the structure of a complex table.

Each sub-lesson in Week 8 examines this distinction through a different modality. This overview sets up the landscape so you know where you are going — and why the distinction matters before you pick up any multimodal tool.

🌏 The Four Modalities

Multimodal AI refers to models that accept inputs beyond plain text. Four modalities are most relevant to research workflows in 2026. Each opens different tasks — and carries different reliability profiles.

📷 Images

Scientific figures, microscopy, satellite imagery, photographs, charts, and diagrams. Images were the first non-text modality to reach general-purpose LLMs and remain the most widely tested.

  • Describing and captioning figures for accessibility
  • Qualitative interpretation of charts and graphs
  • Comparing images for obvious visual differences
  • Satellite and geospatial imagery analysis
  • Microscopy documentation and description

Representative tools: Claude (family), GPT (family), Gemini (family)

🎤 Audio

Interviews, focus groups, field recordings, lectures, oral histories, podcasts, and environmental sound. Audio capabilities range from transcription-only to full end-to-end audio reasoning.

  • Transcription of interviews and focus groups
  • Speaker diarisation (who said what)
  • Transcription in African and low-resource languages
  • Thematic analysis of spoken content
  • Field recording documentation

Representative tools: Whisper large-v3, GPT (family), Gemini (family), Intron Sahara

📄 Documents

PDFs, scanned papers, tables, forms, supplementary data files, archival documents. Document understanding combines OCR with structural reasoning — extracting not just text but how that text is organised.

  • Table extraction from PDFs and supplementary files
  • OCR of scanned archival documents
  • Summarisation of long research papers
  • Structured data extraction from forms
  • Cross-document comparison and synthesis

Representative tools: Claude (family), Docling, LlamaParse, Azure Document Intelligence

🎥 Video

Lectures, experiments, classroom observations, field interviews, documentary footage, and recorded procedures. Video is the most demanding modality — combining visual, audio, and temporal reasoning.

  • Classroom observation analysis
  • Summarising recorded lectures or seminars
  • Experiment documentation from video recordings
  • Interview transcription with visual context
  • Long-form documentary or archival footage analysis

Representative tools: Gemini Pro tier (~1 hr standard, longer at low FPS), GPT (family, frames + audio)

🔬 The Model Landscape

No single model leads on all four modalities. Understanding which model handles which input — and where native capability ends and workarounds begin — prevents both over-reliance and missed opportunity.

Model Developer Images Audio Video Documents Context Window Key Strength for Research
Claude (family) Anthropic ✗ (no native) ✗ (no native) 200K tokens Deep document reasoning, code generation alongside analysis
GPT (family) OpenAI ✓ (end-to-end) ✓ (frames + audio) 128K tokens Best general-purpose; very low audio latency in real-time tier; strong mixed image + text
Gemini (Pro tier) Google DeepMind ✓ (native) ✓ (native, ~1 hr standard) 1M tokens Best for long video/audio; ~1 hr video (standard) or ~8.5 hrs audio-only in one call (durations depend on FPS and resolution; check current API docs)
Whisper large-v3 OpenAI ✓ (transcription) N/A Open-source audio transcription; ~2.0% WER on clean audio

🔔 Audio and Video for Claude Users

Claude does not currently process audio or video natively. If you are working primarily in Claude, the recommended workflow is to combine it with a transcription tool first: use Whisper large-v3 or OpenAI's transcription endpoint to convert audio or video to text, then bring that text into Claude for analysis. Alternatively, use the Gemini API for the audio/video step, then pass the transcript or summary to Claude for deeper reasoning or code generation. The tools complement each other rather than compete.

📍 Research Use-Case Map

Different research domains have different primary modalities. This table maps common UCT research contexts to the modality and tooling most relevant to them. It is a starting point — not every cell will match your specific project.

Research Domain Key Modality Example Use Best Tool
Qualitative (interviews / focus groups) Audio Transcription and thematic analysis Whisper + ATLAS.ti / NVivo
Quantitative (survey data in PDFs) Documents Table extraction from supplementary files Docling / LlamaParse
Scientific publishing Images Figure interpretation, alt text generation Claude (family) / GPT (family)
Field research (Africa, low-resource) Audio Transcription in African languages Intron Sahara / Lelapa AI
Archival research Documents Scanned document OCR Marker / Azure Doc Intelligence
Earth / environmental science Images Satellite imagery analysis TerraTorch + Prithvi / Gemini Geospatial
Medical / health sciences Images Assisting with image description (not diagnosis) Frontier VLM (with verification)
Education research Video Classroom observation analysis Gemini (family) / ClassMind

🧠 Reading vs. Understanding — The Core Distinction

This is the most important conceptual framing of the week. It applies to every modality and every tool. Understanding it before using any multimodal AI is not optional — it is the difference between using these tools safely and being misled by them.

The Reading – Understanding Gap

When an AI model “reads” an image or document, it is pattern-matching against its training data. When it “understands” that image or document, it would be reasoning about what the content means in context. Current models are genuinely impressive at the former — and frequently overconfident about the latter.

A model can describe a chart fluently and still get the numbers wrong. It can transcribe speech accurately and still hallucinate sentences that were never said. It can extract text from a PDF and still misread the structure of a complex table.

The distinction between reading and understanding is not philosophical — it is the practical difference between a tool you can trust and one that will quietly mislead you.

📊 Case Study: The Real-World Chart Gap

The CharXiv benchmark (NeurIPS 2024, Princeton University) tests AI on real scientific charts from actual published papers — not simplified test datasets designed for evaluation convenience. It has become the standard reference for evaluating genuine scientific chart understanding.

When the benchmark was published in 2024, GPT-4o scored 47.1% on reasoning questions versus 80.5% for humans — a gap that received considerable attention. Frontier model scores have risen substantially since (top models now approach human performance on the original benchmark). But the core finding persists in evaluations of newer models: real-world performance on actual scientific figures consistently lags performance on simplified, purpose-built benchmarks. The headline gap has narrowed; the underlying pattern has not gone away.

One mechanism: when models do struggle, they often appear to read labels, axis titles, and captions — the text surrounding the chart — rather than actually processing the chart geometry. The language decoder generates plausible-sounding descriptions that drift from what is visually present in the image. Frontier models have improved on this, but the failure mode is not fully solved.

This is the week's central finding. Each sub-lesson shows a version of it in a different modality: charts, documents, audio. The benchmark-vs-real-world gap is not a chart-specific failure. It is a general property of current multimodal AI that you need to build into your research workflows.

Source: Wang et al. (2024). “CharXiv: Charting Gaps in Realistic Chart Understanding in Multimodal LLMs.” NeurIPS 2024. arxiv.org/abs/2406.18521

⚡ A Note on Rapid Change

🕑 These Capabilities Are a Moving Target

Multimodal capabilities are advancing faster than any other area we cover in this course. Model versions change quarterly. Capability comparisons that are accurate today may not be accurate in six months. A model announced after this page was written may outperform every entry in the table above on one or more dimensions.

The skill this week is not memorising which model does what — it is developing the habit of testing claims about multimodal capability against your own research tasks before trusting them.

CharXiv is a good example: when the benchmark launched in 2024, leading models that scored 90%+ on standard chart benchmarks scored substantially less on real scientific charts. Frontier models have since improved, but the underlying lesson is durable — standard benchmarks rarely capture what matters for research. When you read claims about AI performance on a new multimodal task, the first question should always be: “What was it actually tested on, and how close is that to my data?”

📚 Core Readings

Wang et al. (2024)

CharXiv: Charting Gaps in Realistic Chart Understanding in Multimodal LLMs

NeurIPS 2024 (Princeton University). The benchmark study that established the 47% finding. Essential reading for anyone planning to use AI with scientific figures, charts, or data visualisations. Includes a detailed analysis of why models fail — not just the benchmark scores.

  • 2,323 charts from real published papers
  • Descriptive vs. reasoning question split
  • Human baseline comparison
  • Root-cause analysis of failure modes

arxiv.org/abs/2406.18521 ↗

Rahmanzadehgervi et al. (2024)

Vision Language Models Are Blind

ACCV 2024. Tests models on tasks trivial for humans: overlapping circles, intersecting lines, counting objects in simple arrangements. Models average 58.07% on these tasks — barely better than random on some sub-tasks. Provides the mechanistic explanation for the CharXiv result: models are not seeing the geometry.

  • 7 simple geometric task categories
  • 4 state-of-the-art VLMs evaluated
  • Failure mode taxonomy
  • Implications for scientific image use

arxiv.org/abs/2407.06581 ↗

✅ Summary and What's Next

Week 8 at a Glance

This sub-lesson introduced the four modalities relevant to research — images, audio, documents, and video — and mapped them to the tools and research contexts where they matter most. The model landscape table gives you a working comparison of where the major tools are today; treat it as a snapshot, not a permanent reference.

The central conceptual frame for the week is the reading vs. understanding distinction. The CharXiv benchmark provides the most rigorous available evidence for why this matters: even as model scores have improved, real-world performance on actual scientific figures consistently lags performance on purpose-built evaluation benchmarks. The companion “VLMs Are Blind” paper explains one key mechanism: models often read surrounding text labels rather than processing image geometry.

The remaining sub-lessons go deeper into each domain:

  • Sub-Lesson 2: AI and Scientific Images — the CharXiv finding in depth, the correct-answer-wrong-reasoning problem, domain-specific tools, and bias in image recognition
  • Sub-Lesson 3: Document Intelligence — OCR, table extraction, PDF structure, and where document AI genuinely excels vs. fails
  • Sub-Lesson 4: Transcription and Audio Analysis — Whisper performance, African language support, speaker diarisation, and hallucination in audio transcription
  • Sub-Lesson 5: Video and Multimodal Workflows — Gemini's long-context video capability, temporal reasoning, and practical workflows for lecture and field recording analysis
  • Sub-Lesson 6: Hands-On Activities and Assessment — three practical exercises (figure analysis, self-recorded transcription test, document table extraction) and the weekly assessment

Each sub-lesson includes a practical workflow section — concrete steps for using that modality reliably in your own research, including how to verify outputs before trusting them.